Greek Alphabet Recognition Technique for Biomedical Documents

نویسنده

  • Daniel X. Le
چکیده

Most current commercial optical character recognition (OCR) systems can accurately recognize the text in documents written in a single language. However, when dealing with Greek characters embedded in predominantly English text, these systems do not perform well, and most OCR systems do not recognize the characters as belonging to the Greek alphabet. As a result, the degree of manual review required to validate and correct OCR errors is high. To handle this problem, we propose a new technique based on features calculated from the output of multiple OCR systems, and combined with string pattern matching and document content analysis to improve the recognition of both Greek characters and regular text. Our proposed technique uses two passes of a document page image through OCR systems that use different recognition languages. Experiments carried out on a sample of medical journals show the feasibility of using the proposed technique for Greek character recognition. Preliminary evaluation conducted on a sample of medical journal page images shows that our approach improves the recognition of Greek characters embedded within predominantly English language text.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extracting Conceptual Terms from Medical Documents

Automated biomedical concept recognition is important for biomedical document retrieval and text mining research. In this paper, we describe a two-step concept extraction technique for documents in biomedical domain. Step one includes noun phrase extraction, which can automatically extract noun phrases from medical documents. Extracted noun phrases are used as concept term candidates which beco...

متن کامل

Handwritten Character Recognition using Modified Gradient Descent Technique of Neural Networks and Representation of Conjugate Descent for Training Patterns

The purpose of this study is to analyze the performance of Back propagation algorithm with changing training patterns and the second momentum term in feed forward neural networks. This analysis is conducted on 250 different words of three small letters from the English alphabet. These words are presented to two vertical segmentation programs which are designed in MATLAB and based on portions (1...

متن کامل

A Hidden Markov Model for Alphabet-Soup Word Recognition

Recent work on the “alphabet soup” paradigm has demonstrated effective segmentation-free character-based recognition of cursive handwritten historical text documents. The approach first uses a joint boosting technique to detect potential characters the alphabet soup. A second stage uses a dynamic programming algorithm to recover the correct sequence of characters. Despite experimental success, ...

متن کامل

From the “ancestar” of turtles to the turtle-mouse: when Greek words are used for turtle taxon names

Methods. I critically analyzed the most common Greek words used as taxa names in the chelonian literature to establish their etymology and check whether the transliteration process has been done correctly. I also compared the current guidelines for the latinisation of Greek words recommended by the International Code of Zoological Nomenclature, with other alternative systems for the transformat...

متن کامل

The Development of the Greek Alphabet within the Chronology of the ANE

The Development of the Greek Alphabet within the Chronology of the ANE Andrew Cross University of Calgary November 29, 2009 The transition from a pictogram based writing system to the alphabet transformed societies by bringing literacy to the masses. Pictogram based writing systems such as Egyptian hieroglyphics used more than 400 signs to represent syllables or objects, many of which had mult...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002